[Wasm RyuJIT] Initial writeup on the calling convention #122988

AndyAyersMS · 2026-01-07T20:07:05Z

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.

AndyAyersMS · 2026-01-07T20:08:57Z

PTAL @dotnet/wasm-contrib... this is a first draft so please comment / help fix.

Copilot

Pull request overview

This PR adds comprehensive documentation for the WebAssembly calling convention used by R2R (Ready-to-Run) and JIT-compiled managed code in the CLR. The documentation describes how the runtime interoperates with WebAssembly's stack model, calling sequences, and garbage collection integration.

Key Changes:

Adds a new "Web Assembly ABI (R2R and JIT)" section to the CLR ABI documentation
Documents stack layout, argument passing conventions, prolog/epilog behavior, and calling sequences
Explains GC reference handling at call sites and the portable entry point mechanism

docs/design/coreclr/botr/clr-abi.md

Co-authored-by: Copilot <[email protected]>

docs/design/coreclr/botr/clr-abi.md

kg · 2026-01-07T23:36:35Z

This is probably a good time to raise that I think we shouldn't pass the stack pointer in an argument.
Reasons for it:

get_local 0 is 6 bytes when linking because the linker relocation needs to be 5 bytes so it can be patched by the linker, IIRC. (We're not linking, though!)
Loading a local might be faster than loading a global in wasm (I don't know how to verify this though)

Reasons against it:

Code size goes up because we need to copy the stack pointer into and out of the global at very many locations
More room for bugs caused by the stack pointer getting out of sync with the local
The extra argument makes it more likely that actual arguments won't occupy argument registers once our wasm is jitted/aot'd

Given that we're not linking I view the sp argument thing as a Potential Optimization and I feel like it's premature. Single or Yowl may have a good reason why we should still do it though based on their experiences.

I believe what we would do instead is export the stack_pointer from the runtime's wasm module (this may already be the default behavior for emscripten, to enable dynamic linking?) then grab the WebAssembly.Global for the stack_pointer and import it into every r2r module we load. Then all our code can manipulate the linear stack the same way clang generated code does.

AndyAyersMS · 2026-01-08T00:14:06Z

we shouldn't pass the stack pointer in an argument.

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

SingleAccretion · 2026-01-08T00:31:30Z

Following on this, could we also use a small index global to pass the portable entry point? Then we would not need the interpreter->managed stub to invoke the method with an extra unused argument.

It doesn't seem necessary to optimize for the stub case? Since stubs will be a very small proportion of the overall code, and every managed callsite will need to be made at least 2 bytes bigger (for global.set). In addition to the inflexibilities of hardcoding global indices.

yowl · 2026-01-08T00:36:20Z

How will a global.set work when threads are a thing?

kg · 2026-01-08T00:39:11Z

How will a global.set work when threads are a thing?

Aren't globals functionally thread-local in wasm? Is it different for wasi?

SingleAccretion · 2026-01-08T00:59:45Z

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

Another edit: more - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2024/THREADS-07-09.md.

kg · 2026-01-08T01:22:35Z

Aren't globals functionally thread-local in wasm? Is it different for wasi?

I think @yowl is referring to the complexities of this: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md#thread-local-storage-tls.

Without instance-per-thread, things get a bit expensive for globals. I don't know what the current expected implementation strategy is. I know it (__stack_pointer access cost) was brought up in the shared-everything proposal evolution history and they discussed things like giving a special TLS slot just for the stack pointer. This would need more research.

Edit: found some discussion about this - https://github.com/WebAssembly/meetings/blob/ca764085f4ac4c750b0500d9f2b7e1648636f503/threads/2025/THREADS-03-04.md. This also reminds us that an imported global requires two indirections instead of one.

This also appears to mention that importing a global turns it into two loads instead of one, which is a bit of a problem. But I think we have to import sp no matter what and it's just a question of how often we're going to touch it.

dotnet-policy-service · 2026-01-08T10:02:58Z

Tagging subscribers to 'arch-wasm': @lewing, @pavelsavara
See info in area-owners.md if you want to be subscribed.

docs/design/coreclr/botr/clr-abi.md

AndyAyersMS · 2026-01-09T01:09:37Z

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

Global SP always in sync

;; PROLOG

global.get $__stack_pointer    ;; (2, if can get a small index), else 6.
i32.const FRAMESIZE            ;; (2 typically)
i32.sub                        ;; 1
dup                            ;; 1
global.set $__stack_pointer    ;; (2/6)
local.set sp                   ;; 2

;; EPILOG

local.get sp                   ;; 2
i32.const FRAMESIZE            ;; 2ish
i32.add                        ;; 1
global.set $__stack_pointer    ;; 2/6

So 10/18 bytes per prolog, 7/11 bytes per epilog

No overhead at call sites. Smaller signatures.

Global SP lazy sync at boundaries

;; PROLOG

local.get sp                  ;; 2
i32.const FRAMESIZE           ;; 2
i32.sub                       ;; 1
local.set sp                  ;; 2

;; EPILOG

(empty)

So 7 bytes per prolog, 0 per epilog

;; unmanaged call sites & fcalls

local.get  sp                 ;; 2
global.set $__stack_pointer   ;; 2/6   (~0 amortizable for fcalls)

;; managed call sites

local.get  sp                 ;; 2

For the current x86 crossgen SPMI collection (which may not represent the set of methods we care about) there are 273829 methods / prologs, 312716 epilogs, 1504289 managed call sites, and 163641 helper call sites.

So assuming the optimistic case where we can encode the global SP index in one byte and wrap FCalls with a global SP update (size assumed negligible) I get size estimates like:

global SP   in sync: 5.581M bytes  (~20 bytes/method)
global SP lazy sync: 4.925M bytes  (~18 bytes/method)

If we can't get a small index for the SP global then the "sync" cost rises to 8.3M (~30 bytes/method).

This may overstate the difference somewhat. For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

SingleAccretion · 2026-01-09T01:21:12Z

Some crude estimates of the size costs of an SP arg vs maintaining the global SP.

This matches my calculations based on NAOT-LLVM data from last year almost perfectly: "50%" code size for the large index encoding, about the same for the small encoding.

For leaf methods we generally won't need these SP sequences... I haven't tried to account for those yet.

The proportion of truly leaf methods is going to be rather low. It was on the order of < 5% on real-world NAOT-LLVM data (this is from dotnet/runtimelab#2697). This is because most methods may throw (an NRE), requiring a helper call.

jkotas · 2026-01-09T07:24:37Z

This is because most methods may throw (an NRE), requiring a helper call.

Was this with assumption that this can be null? It is a gray area. this can be never null in C#. I think it would be ok to assume that this is never null for wasm.

SingleAccretion · 2026-01-09T17:20:53Z

Was this with assumption that this can be null?

Yes. I can re-measure how this works out with the non-null-this assumption. Though I personally wouldn't support such a thing "for WASM only". And it would be a breaking change for structs (((S*)null)->Method();, which works today, would become UB).

jkotas · 2026-01-09T23:11:44Z

I believe we have some non-deterministic behaviors around null this pointer for reference types today. I would not feel bad about more UB there by default. It is impossible to end up with null this for reference types in C#.

it would be a breaking change for structs (((S*)null)->Method();, which works today

I guess we can keep them for structs.

SingleAccretion · 2026-01-11T21:57:07Z

Single or Yowl may have a good reason why we should still do it though based on their experiences.

I'll write down my thinking on this question. I'll use $sp to represent the argument scheme and $__stack_pointer to represent the global.set scheme.

Code Size

As per Andy's study above (and my earlier investigation as well), the size impact is "weakly in favor" of using $sp. I've also looked into Jan's suggestion above about non-null TYP_REF this, and while it does increase the number of leafs (5% -> 8% for my Avalonia sample app), that is not significant enough. That said, I have not investigated what the numbers look like if we allow tail-calling the NRE helper (thus discarding the need for a frame), since this will degrade diagnostics and is not a parity experience with other platforms. I can look into that if needed [but it's less trivial than the experiments so far].

"Weakly in favor" because:

We have neutral impact for scenarios where we can hardcode the __stack_pointer index, i. e. traditional dynamically loaded R2R.
We have positive (significant) impact for scenarios where we can't hardcode __stack_pointer and aren't able to use "shrinking". This is the NAOT scenario, or a potential "fused" R2R scheme where we could combine native code directly with R2R code.

Throughput

This point is clearly in favor of $sp. As we also discussed above, an imported $__stack_pointer is at least two indirections, potentially (significantly?) more with multi-threading in the picture. Since $sp is going to one of the hottest value in the majority of the methods, it will also on balance be a good tradeoff to "use up" an register argument slot for it.

Runtime complexity & interop

This point is clearly in favor of $__stack_pointer, since:

We won't need to deal with a class of bugs arising from "SP desync". I don't know how big of an issue is that. My speculation/feeling is that it shouldn't be a big problem.
Simpler FCall implementation. I would however note that depending on whether we can find a clang builtin for "get caller's SP", we would still need manual assembly for those FCalls that need to be stackwalking roots (allocation helpers, potentially EH?).
The managed<->native transitions get a bit faster and smaller (I don't think it is too important to optimize them though).

Based on the above I would personally be weakly in favor of $sp as the convention better reflecting the reality of managed code where most methods will need to use the stack, as opposed to native code where $__stack_pointer is only needed for address-taken structures / local arrays (recall the default WASM stack size is 65K).

AndyAyersMS · 2026-01-12T22:04:36Z

From what I can tell there is no way to have LLVM inline assembly insert instructions before the prolog, so something like

int add(int* sp, int b) {
    __asm("local.get 0\n global.set __stack_pointer\n");
   return b; 
}

generates (without opts) code like this per compiler explorer

add(int*, int):
        global.get      __stack_pointer
        i32.const       16
        i32.sub 
        local.set       2
        local.get       2
        local.get       0
        i32.store       12
        local.get       2
        local.get       1
        i32.store       8

        ;; inline asm payload
        local.get       0
        global.set      __stack_pointer

        local.get       2
        i32.load        8
        return
        end_function

So it seems with the pure lazy $sp approach helper call wrappers for those helpers implemented in native code must be created in Wasm directly. With wrappers we could still add the extra ignored initial arg to the C++ helpers when building for wasm to avoid having to do arg shuffling assuming we don't have any wrapped helpers that return structs, though we'd have to adapt existing callers. We could make $sp be the last argument but this is less efficient for managed callees.

If we do the lazy sync and can't or don't want to do wrappers, we can reduce the sync cost a bit further by only setting $__stack_pointer in the prolog of methods that make helper calls. There are 0.6 helper calls per method but the fraction of methods with helper call sites is around 0.2.

pavelsavara · 2026-01-12T22:17:15Z

From what I can tell there is no way to have LLVM inline assembly insert instructions before the prolog, so something like

Do I understand right that we only need this for NativeAOT-LLVM ?

Maybe it would be ok to create wrapper function with the global.set __stack_pointer and call to the real thing.
Then rely on wasm-opt --one-caller-inline-max-function-size

https://manpages.debian.org/testing/binaryen/wasm-opt.1.en.html#one

pavelsavara · 2026-01-12T22:20:25Z

Here is how we create direct wasm with LLVM
https://github.com/dotnet/runtime/blob/main/src/mono/mono/utils/mono-threads-wasm.S

SingleAccretion · 2026-01-12T23:03:19Z

I don't think we need any fancy tooling to create these wrappers. All fcalls are implemented with macros that already have most of the information we need - it will need to be augmented with the ABI types (alternatively, you can play games with compile-time string concatenation using constexpr). Concept:

// Usage
#define FCIMPL_VOID_I(foo, void *p)

// Rough definition
#define FCIMPL_VOID_I(funcname, a1) \
__asm(
  .functype funcname (i32, i32) -> ()
  .global funcname
funcname:
  local.get 0
  global.set __stack_pointer
  local.get 1
  call funcname#_native
  end_function
);

void F_CALL_CONV funcname##_native(a1) { FCIMPL_PROLOG(funcname)

AndyAyersMS · 2026-01-14T00:18:57Z

Given the above, I propose that we go with the lazy $sp approach for now.

For the managed wrappers outlined above, do we need to be careful not to mess things up for the interpreter?

jkotas · 2026-01-14T00:30:55Z

For the managed wrappers outlined above, do we need to be careful not to mess things up for the interpreter?

It should just work. The interpreter will call FCalls like any other method with native code.

docs/design/coreclr/botr/clr-abi.md

Co-authored-by: Jan Kotas <[email protected]>

AndyAyersMS · 2026-01-14T21:09:43Z

/ba-g markdown only change (build analysis seems to be confused)

[Wasm RyuJIT] Initial writeup on the calling convention

4b39254

Describe the calling convention used by R2R (and perhaps someday jitted) Wasm code.

Copilot AI review requested due to automatic review settings January 7, 2026 20:07

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 7, 2026

dotnet-policy-service bot assigned AndyAyersMS Jan 7, 2026

Copilot started reviewing on behalf of AndyAyersMS January 7, 2026 20:07 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

Apply suggestions from code review

cd1ddca

Co-authored-by: Copilot <[email protected]>

AaronRobinsonMSFT reviewed Jan 7, 2026

View reviewed changes

docs/design/coreclr/botr/clr-abi.md Show resolved Hide resolved

docs/design/coreclr/botr/clr-abi.md Outdated Show resolved Hide resolved

docs/design/coreclr/botr/clr-abi.md Outdated Show resolved Hide resolved

docs/design/coreclr/botr/clr-abi.md Outdated Show resolved Hide resolved

kg reviewed Jan 7, 2026

View reviewed changes